In [53]:
import graphlab
import graphlab as gl
In [2]:
sales = graphlab.SFrame('home_data.gl/')
In [3]:
sales
Out[3]:
The house price is correlated with the number of square feet of living space.
In [3]:
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")
Split data into training and testing.
We use seed=0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).
In [54]:
train_data,test_data = sales.random_split(.8,seed=0)
In [8]:
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'])
In [8]:
print test_data['price'].mean()
In [9]:
print sqft_model.evaluate(test_data)
RMSE of about \$255,170!
Matplotlib is a Python plotting library that is also useful for plotting. You can install it with:
'pip install matplotlib'
In [78]:
import matplotlib.pyplot as plt
%matplotlib inline
In [9]:
plt.plot(test_data['sqft_living'],test_data['price'],'.',
test_data['sqft_living'],sqft_model.predict(test_data),'-')
Out[9]:
Above: blue dots are original data, green line is the prediction from the simple regression.
Below: we can view the learned regression coefficients.
In [12]:
sqft_model.get('coefficients')
Out[12]:
In [50]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
In [11]:
sales[my_features].show()
In [12]:
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')
Pull the bar at the bottom to view more of the data.
98039 is the most expensive zip code.
In [75]:
my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features)
In [19]:
print my_features
In [20]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
The RMSE goes down from \$255,170 to \$179,508 with more features.
The first house we will use is considered an "average" house in Seattle.
In [85]:
house1 = sales[sales['id']=='5309101200']
In [86]:
print house1
In [57]:
print house1['price']
In [58]:
print sqft_model.predict(house1)
In [59]:
print my_features_model.predict(house1)
In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features is better.
In [87]:
house2 = sales[sales['id']=='1925069082']
In [88]:
print house2
In [63]:
print sqft_model.predict(house2)
In [64]:
print my_features_model.predict(house2)
In this case, the model with more features provides a better prediction. This behavior is expected here, because this house is more differentiated by features that go beyond its square feet of living space, especially the fact that it's a waterfront house.
In [65]:
bill_gates = {'bedrooms':[8],
'bathrooms':[25],
'sqft_living':[50000],
'sqft_lot':[225000],
'floors':[4],
'zipcode':['98039'],
'condition':[10],
'grade':[10],
'waterfront':[1],
'view':[4],
'sqft_above':[37500],
'sqft_basement':[12500],
'yr_built':[1994],
'yr_renovated':[2010],
'lat':[47.627606],
'long':[-122.242054],
'sqft_living15':[5000],
'sqft_lot15':[40000]}
In [89]:
print my_features_model.predict(graphlab.SFrame(bill_gates))
The model predicts a price of over $13M for this house! But we expect the house to cost much more. (There are very few samples in the dataset of houses that are this fancy, so we don't expect the model to capture a perfect prediction here.)
In [14]:
hiZipcode = sales[sales['zipcode'] == '98039']
In [15]:
print hiZipcode
In [19]:
hiZipcode['price'].mean()
Out[19]:
In [41]:
## Houses with 2000 <square feet < 4000
myHouses = sales[(sales['sqft_living'] >= 2000) & (sales['sqft_living'] <= 4000)]
In [42]:
print myHouses.show(view='BoxWhisker Plot', x='price', y='sqft_living')
In [100]:
myhouses_count = len(myHouses['id'])
allhouses_count = len(sales['id'])
print myhouses_count
print allhouses_count
In [103]:
## Fraction
print 9221/21613.0
In [46]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house
'grade', # measure of quality of construction
'waterfront', # waterfront property
'view', # type of view
'sqft_above', # square feet above ground
'sqft_basement', # square feet in basement
'yr_built', # the year built
'yr_renovated', # the year renovated
'lat', 'long', # the lat-long of the parcel
'sqft_living15', # average sq.ft. of 15 nearest neighbors
'sqft_lot15', # average lot size of 15 nearest neighbors
]
In [51]:
## Show
sales[advanced_features].show()
In [69]:
# Building the advanced model
advanced_model = gl.linear_regression.create(train_data, target='price', features = advanced_features)
In [73]:
## evalute
print advanced_model.evaluate(test_data)
In [77]:
print advanced_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
In [83]:
print advanced_model.get('coefficients')
In [90]:
plt.plot(test_data['sqft_living'],test_data['price'],'o',
test_data['sqft_living'], advanced_model.predict(test_data), '-')
Out[90]:
In [93]:
print house1['price']
print sqft_model.predict(house1)
print my_features_model.predict(house1)
print advanced_model.predict(house1)
In [94]:
## using advanced_model
print advanced_model.predict(gl.SFrame(bill_gates))
In [95]:
## sqft_model
print sqft_model.predict(gl.SFrame(bill_gates))
In [96]:
## my_features_model
print my_features_model.predict(gl.SFrame(bill_gates))
In [ ]: